48 research outputs found

    Stratification bias in low signal microarray studies

    Get PDF
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

    Simple SVM based whole-genome segmentation

    Get PDF
    We present a support vector machine (SVM) based framework for DNA segmentation into binary classes. Two applications are explored: transcription start site prediction and transcription factor binding prediction. Experiments demonstrate our approach has significantly better performance than other methods on both tasks

    Precision-mapping and statistical validation of quantitative trait loci by machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We introduce a QTL-mapping algorithm based on Statistical Machine Learning (SML) that is conceptually quite different to existing methods as there is a strong focus on generalisation ability. Our approach combines ridge regression, recursive feature elimination, and estimation of generalisation performance and marker effects using bootstrap resampling. Model performance and marker effects are determined using independent testing samples (individuals), thus providing better estimates. We compare the performance of SML against Composite Interval Mapping (CIM), Bayesian Interval Mapping (BIM) and single Marker Regression (MR) on synthetic datasets and a multi-trait and multi-environment dataset of the progeny for a cross between two barley cultivars.</p> <p>Results</p> <p>In an analysis of the synthetic datasets, SML accurately predicted the number of QTL underlying a trait while BIM tended to underestimate the number of QTL. The QTL identified by SML for the barley dataset broadly coincided with known QTL locations. SML reported approximately half of the QTL reported by either CIM or MR, not unexpected given that neither CIM nor MR incorporates independent testing. The latter makes these two methods susceptible to producing overly optimistic estimates of QTL effects, as we demonstrate for MR. The QTL resolution (peak definition) afforded by SML was consistently superior to MR, CIM and BIM, with QTL detection power similar to BIM. The precision of SML was underscored by repeatedly identifying, at ≤ 1-cM precision, three QTL for four partially related traits (heading date, plant height, lodging and yield). The set of QTL obtained using a 'raw' and a 'curated' version of the same genotypic dataset were more similar to each other for SML than for CIM or MR.</p> <p>Conclusion</p> <p>The SML algorithm produces better estimates of QTL effects because it eliminates the optimistic bias in the predictive performance of other QTL methods. It produces narrower peaks than other methods (except BIM) and hence identifies QTL with greater precision. It is more robust to genotyping and linkage mapping errors, and identifies markers linked to QTL in the absence of a genetic map.</p

    A distinct DNA methylation signature defines pediatric pre-B cell acute lymphoblastic leukemia

    Full text link
    Pre-B cell acute lymphoblastic leukemia (ALL) is the most prevalent childhood malignancy and remains one of the highest causes of childhood mortality. Despite this, the mechanisms leading to disease remain poorly understood. We asked if recurrent aberrant DNA methylation plays a role in childhood ALL and have defined a genome-scale DNA methylation profile associated with the ETV6-RUNX1 subtype of pediatric ALL. Archival bone marrow smears from 19 children collected at diagnosis and remission were used to derive a disease specific DNA methylation profile. The gene signature was confirmed in an independent cohort of 86 patients. A further 163 patients were analyzed for DNA methylation of a three gene signature. We found that the DNA methylation signature at diagnosis was unique from remission. Fifteen loci were sufficient to discriminate leukemia from disease-free samples and purified CD34+ cells. DNA methylation of these loci was recurrent irrespective of cytogenetic subtype of pre-B cell ALL. We show that recurrent aberrant genomic methylation is a common feature of pre-B ALL, suggesting a shared pathway for disease development. By revealing new DNA methylation markers associated with disease, this study has identified putative targets for development of novel epigenetic-based therapies

    A blood-based predictor for neocortical Aβ burden in Alzheimer\u27s disease: results from the AIBL study

    Get PDF
    Dementia is a global epidemic with Alzheimer’s disease (AD) being the leading cause. Early identification of patients at risk of developing AD is now becoming an international priority. Neocortical Aβ (extracellular β-amyloid) burden (NAB), as assessed by positron emission tomography (PET), represents one such marker for early identification. These scans are expensive and are not widely available, thus, there is a need for cheaper and more widely accessible alternatives. Addressing this need, a blood biomarker-based signature having efficacy for the prediction of NAB and which can be easily adapted for population screening is described. Blood data (176 analytes measured in plasma) and Pittsburgh Compound B (PiB)-PET measurements from 273 participants from the Australian Imaging, Biomarkers and Lifestyle (AIBL) study were utilised. Univariate analysis was conducted to assess the difference of plasma measures between high and low NAB groups, and cross-validated machine-learning models were generated for predicting NAB. These models were applied to 817 non-imaged AIBL subjects and 82 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) for validation. Five analytes showed significant difference between subjects with high compared to low NAB. A machine-learning model (based on nine markers) achieved sensitivity and specificity of 80 and 82%, respectively, for predicting NAB. Validation using the ADNI cohort yielded similar results (sensitivity 79% and specificity 76%). These results show that a panel of blood-based biomarkers is able to accurately predict NAB, supporting the hypothesis for a relationship between a blood-based signature and Aβ accumulation, therefore, providing a platform for developing a population-based scree

    Epithelial-to-mesenchymal transition supports ovarian carcinosarcoma tumorigenesis and confers sensitivity to microtubule-targeting with eribulin

    Get PDF
    Ovarian carcinosarcoma (OCS) is an aggressive and rare tumour type with limited treatment options. OCS is hypothesised to develop via the combination theory, with a single progenitor resulting in carcinomatous and sarcomatous components, or alternatively via the conversion theory, with the sarcomatous component developing from the carcinomatous component through epithelial-to-mesenchymal transition (EMT). In this study, we analysed DNA variants from isolated carcinoma and sarcoma components to show that OCS from 18 women is monoclonal. RNA sequencing indicated the carcinoma components were more mesenchymal when compared with pure epithelial ovarian carcinomas, supporting the conversion theory and suggesting that EMT is important in the formation of these tumours. Preclinical OCS models were used to test the efficacy of microtubule-targeting drugs, including eribulin, which has previously been shown to reverse EMT characteristics in breast cancers and induce differentiation in sarcomas. Vinorelbine and eribulin more effectively inhibited OCS growth than standard-of-care platinum-based chemotherapy, and treatment with eribulin reduced mesenchymal characteristics and N-MYC expression in OCS patient-derived xenografts (PDX). Eribulin treatment resulted in an accumulation of intracellular cholesterol in OCS cells, which triggered a down-regulation of the mevalonate pathway and prevented further cholesterol biosynthesis. Finally, eribulin increased expression of genes related to immune activation and increased the intratumoral accumulation of CD8+ T cells, supporting exploration of immunotherapy combinations in the clinic. Together, these data indicate EMT plays a key role in OCS tumourigenesis and support the conversion theory for OCS histogenesis. Targeting EMT using eribulin could help improve OCS patient outcomes

    Microarray design using the Hilbert-schmidt independence criterion

    No full text
    This paper explores the design problem of selecting a small subset of clones from a large pool for creation of a microarray plate. A new kernel based unsupervised feature selection method using the Hilbert-Schmidt independence criterion (hsic) is presented and evaluated on three microarray datasets: the Alon colon cancer dataset, the van 't Veer breast cancer dataset, and a multiclass cancer of unknown primary dataset. The experiments show that subsets selected by the hsic resulted in equivalent or better performance than supervised feature selection, with the added benefit that the subsets are not target specific
    corecore